The goal of this project is to analyze the complex relationships between economic and population growth, sustainable energy practices, and energy consumption
Author
Affiliation
data detectives - Ayesha, Abhishek, Sheemithra, Toluwanimi, Valerie, Alyssa
School of Information, University of Arizona
Abstract
Add project abstract here.
Question 1
Is it possible to predict a nation’s power consumption by considering its population size, gross domestic product (GDP), and the percentage of electricity generated from renewable sources and changes across the years?
Mean Squared Error: 0.000261923447842426
Mean Squared Error: 2.937309271249316e-05
Mean Squared Error: 0.0001313336189156295
Mean Squared Error (Multiple Regression): 3.203271840932969e-05
Model: Population vs Energy Consumption
R-squared: 0.3531414344090452
RMSE: 0.016184049179436708
------------------
Model: Gdp vs Energy Consumption
R-squared: 0.9306708673762983
RMSE: 0.0054196948910887185
------------------
Model: Renewables_electricity vs Energy Consumption
R-squared: 0.6549275418223406
RMSE: 0.011460088084985625
------------------
Multiple Regression Model:
R-squared: 0.9273399797855412
RMSE: 0.005659745436795695
Question 2
What countries or regions are engaging in sustainable energy practices and relying more on renewable energy compared to nonrenewable energy? Which countries are moving towards the trajectory of relying more on renewable energy and producing less greenhouse gas emissions?
Repo Organization
The following folders comprise the project repository
.github/: This directory is designated for files associated with GitHub, encompassing workflows, actions, and templates tailored for issues.
_extra/: Reserved for miscellaneous files that don’t neatly fit into other project categories, providing a catch-all space for various supplementary documents.
_freeze/: Within this directory lie frozen environment files containing comprehensive information regarding the project’s environment configuration and dependencies.
data/: Specifically allocated for storing i data files crucial for the project’s functionality, encompassing input files, datasets, and other essential data resources.
images/: Serving as a repository for visual assets employed throughout the project, including diagrams, charts, and screenshots, this directory maintains visual elements integral to project documentation and presentation.
.gitignore: This file functions to specify exclusions from version control, ensuring that designated files and directories remain untracked by Git, thus streamlining the versioning process.
README.md: Serving as the primary hub of project information, this README document furnishes essential details encompassing project setup, usage instructions, and an overarching overview of project objectives and scope.
_quarto.yml: Acting as a pivotal configuration file for Quarto, this document encapsulates various settings and options governing the construction and rendering of Quarto documents, facilitating customization and control over document output.
about.qmd: This Quarto Markdown file supplements project documentation by providing additional contextual information, elucidating project purpose, contributor insights, and other pertinent project details.
index.qmd: index.qmd: This serves as the main documentation page for our project. This Quarto Markdown file provides detailed descriptions of our project, including all code and visualization .
Source Code
---title: "Global Energy Trends"subtitle: "INFO 523 - Project Final"author: - name: "data detectives - Ayesha, Abhishek, Sheemithra, Toluwanimi, Valerie, Alyssa" affiliations: - name: "School of Information, University of Arizona"description: "The goal of this project is to analyze the complex relationships between economic and population growth, sustainable energy practices, and energy consumption"format: html: code-tools: true code-overflow: wrap embed-resources: trueeditor: visualexecute: warning: false echo: falsejupyter: python3---## AbstractAdd project abstract here.# Question 1Is it possible to predict a nation's power consumption by considering its population size, gross domestic product (GDP), and the percentage of electricity generated from renewable sources and changes across the years?```{python}#| label: question#| echo: false#cell libraryimport pandas as pd# Load the datasetdata = pd.read_csv('data/owid-energy-data.csv')#columns to keepkeep = (['year', 'population', 'gdp', 'electricity_generation', 'primary_energy_consumption', 'renewables_electricity'])#data for q1q1_data = data[keep]# Drop rows with any empty valuesq1_data_cleaned = q1_data.dropna()# Save the cleaned dataset to a new CSV fileq1_data_cleaned.to_csv('data/q1_energy_data_cleaned.csv', index=False)``````{python}#| label: question1#| echo: false# Load the datasetdata = pd.read_csv('data/q1_energy_data_cleaned.csv')# Calculate the percentage of electricity generated from renewable sourcesdata['renewables_percentage'] = (data['renewables_electricity'] / data['electricity_generation']) *100# Standardize or normalize the relevant columns (if needed) # Since the columns are already of type 'double', normalization might be beneficial# Here's an example of Min-Max normalization# You can choose to apply normalization to specific columns as neededcolumns_to_normalize = ['electricity_generation', 'primary_energy_consumption', 'renewables_electricity']for column in columns_to_normalize: data[column] = (data[column] - data[column].min()) / (data[column].max() - data[column].min())# Save the updated dataset to a new CSV filedata.to_csv('data/q1_energy_data_processed.csv', index=False)``````{python}#| label: question2#| echo: falseimport matplotlib.pyplot as pltimport seaborn as sns# Load the datasetdata = pd.read_csv('data/q1_energy_data_processed.csv')plt.figure(figsize=(10, 8))sns.histplot(data['primary_energy_consumption'], kde=True)plt.title('Distribution of Target Variable "primary_energy_consumption"')plt.xlabel('Primary Energy Consumption')plt.ylabel('Frequency')plt.xlim(0, 0.065) # Set x-axis limitsplt.ylim(0, 90) # Set y-axis limitsplt.show()``````{python}#| label: question3#| echo: falseimport matplotlib.pyplot as pltimport seaborn as sns# Load the datasetdata = pd.read_csv('data/q1_energy_data_processed.csv')# Plotting pairplot of relevant columnssns.pairplot(data[['population', 'gdp', 'renewables_electricity', 'primary_energy_consumption']])plt.title('Pairplot of Features and Target Variable')plt.show()# Plotting correlation matrixplt.figure(figsize=(10, 8))sns.heatmap(data[['population', 'gdp', 'renewables_electricity', 'primary_energy_consumption']].corr(), annot=True, cmap='coolwarm', fmt=".2f")plt.title('Correlation Matrix')plt.show()# Plotting scatter plots of features against target variableplt.figure(figsize=(15, 10))# Loop through each feature and plot against the target variablefor i, column inenumerate(['population', 'gdp', 'renewables_electricity']): plt.subplot(2, 2, i+1) sns.scatterplot(x=column, y='primary_energy_consumption', data=data) plt.title(f'Scatter plot of {column} vs Primary Energy Consumption') plt.xlabel(column) plt.ylabel('Primary Energy Consumption')plt.tight_layout()plt.show()``````{python}#| label: question4#| echo: false#| warning: falseimport pandas as pdimport matplotlib.pyplot as plt import statsmodels.api as smfrom statsmodels.tsa.seasonal import seasonal_decomposefrom sklearn.linear_model import LinearRegressionfrom sklearn.model_selection import train_test_splitfrom sklearn.metrics import mean_squared_error# Load the datasetq1_energy_data = pd.read_csv('data/q1_energy_data_processed.csv')# Time Series Analysis# Decompose time series data to analyze trends, seasonality, and residualsdecomposition = seasonal_decompose(q1_energy_data['primary_energy_consumption'], model='additive', period=1)decomposition.plot()plt.title('Time Series Decomposition')plt.show()# Regression Analysis# Function to perform linear regression and plot the resultsdef perform_linear_regression(x, y, x_label): x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(x_train.values.reshape(-1, 1), y_train) y_pred = model.predict(x_test.values.reshape(-1, 1)) mse = mean_squared_error(y_test, y_pred) plt.scatter(x_test, y_test, color='blue', label='Actual') plt.plot(x_test, y_pred, color='red', label='Predicted') plt.title(f'Simple Linear Regression: {x_label} vs Primary Energy Consumption') plt.xlabel(x_label) plt.ylabel('Primary Energy Consumption') plt.legend() plt.show()print(f'Mean Squared Error: {mse}')# Perform linear regression for each independent variablefor column in ['population', 'gdp', 'renewables_electricity']: perform_linear_regression(q1_energy_data[column], q1_energy_data['primary_energy_consumption'], column)# Multiple Regression# Combine multiple independent variables and perform regressionX = q1_energy_data[['population', 'gdp', 'renewables_electricity']]y = q1_energy_data['primary_energy_consumption']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)model = LinearRegression()model.fit(X_train, y_train)y_pred = model.predict(X_test)mse = mean_squared_error(y_test, y_pred)print(f'Mean Squared Error (Multiple Regression): {mse}')##### Line Graphs for Time Series Dataplt.figure(figsize=(15, 10))plt.subplot(2, 2, 1)plt.plot(q1_energy_data['year'], q1_energy_data['primary_energy_consumption'], marker='o', color='blue')plt.title('Primary Energy Consumption Over Time')plt.xlabel('Year')plt.ylabel('Primary Energy Consumption')plt.subplot(2, 2, 2)plt.plot(q1_energy_data['year'], q1_energy_data['population'], marker='o', color='green')plt.title('Population Growth Over Time')plt.xlabel('Year')plt.ylabel('Population')plt.subplot(2, 2, 3)plt.plot(q1_energy_data['year'], q1_energy_data['gdp'], marker='o', color='red')plt.title('GDP Over Time')plt.xlabel('Year')plt.ylabel('GDP')plt.subplot(2, 2, 4)plt.plot(q1_energy_data['year'], q1_energy_data['renewables_electricity'], marker='o', color='orange')plt.title('Renewable Electricity Over Time')plt.xlabel('Year')plt.ylabel('Renewable Electricity')plt.tight_layout()plt.show()# Scatterplotsplt.figure(figsize=(15, 5))plt.subplot(1, 3, 1)plt.scatter(q1_energy_data['population'], q1_energy_data['primary_energy_consumption'], color='blue')plt.title('Population vs Energy Consumption')plt.xlabel('Population')plt.ylabel('Primary Energy Consumption')plt.subplot(1, 3, 2)plt.scatter(q1_energy_data['gdp'], q1_energy_data['primary_energy_consumption'], color='green')plt.title('GDP vs Energy Consumption')plt.xlabel('GDP')plt.ylabel('Primary Energy Consumption')plt.subplot(1, 3, 3)plt.scatter(q1_energy_data['renewables_electricity'], q1_energy_data['primary_energy_consumption'], color='red')plt.title('Renewable Electricity vs Energy Consumption')plt.xlabel('Renewable Electricity')plt.ylabel('Primary Energy Consumption')plt.tight_layout()plt.show()# Model Evaluation# Metrics for Simple Linear Regression# R-squared and RMSEdef evaluate_model(true, pred): r_squared = sm.OLS(true, pred).fit().rsquared rmse = mean_squared_error(true, pred, squared=False)return r_squared, rmse# Evaluate each simple linear regression modelfor column in ['population', 'gdp', 'renewables_electricity']: X = q1_energy_data[column].values.reshape(-1, 1) y = q1_energy_data['primary_energy_consumption'] X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42) model = LinearRegression() model.fit(X_train, y_train) y_pred = model.predict(X_test) r_squared, rmse = evaluate_model(y_test, y_pred)print(f'Model: {column.capitalize()} vs Energy Consumption')print(f'R-squared: {r_squared}')print(f'RMSE: {rmse}')print('------------------')# Metrics for Multiple RegressionX = q1_energy_data[['population', 'gdp', 'renewables_electricity']]y = q1_energy_data['primary_energy_consumption']X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)model = LinearRegression()model.fit(X_train, y_train)y_pred = model.predict(X_test)r_squared, rmse = evaluate_model(y_test, y_pred)print('Multiple Regression Model:')print(f'R-squared: {r_squared}')print(f'RMSE: {rmse}')```# Question 2What countries or regions are engaging in sustainable energy practices and relying more on renewable energy compared to nonrenewable energy? Which countries are moving towards the trajectory of relying more on renewable energy and producing less greenhouse gas emissions?```{python}#| label: MAP1#| echo: false#| warning: falseimport pandas as pdimport plotly.express as pximport geopandas as gpd# Load the datadata = pd.read_csv('data/owid-energy-data.csv')# Load world shapefile for mappingworld = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))# Merge world shapefile with energy dataworld = world.merge(data, how='left', left_on='iso_a3', right_on='iso_code')# Ensure data is filtered from 2000 to 2023world = world[(world['year'] >=2000) & (world['year'] <=2023)]# Calculate the share of renewable energy consumptionworld['renewables_share'] = world['renewables_electricity'] / world['electricity_generation']# Plot the animated mapfig = px.choropleth(world, locations='iso_a3', color='renewables_share', hover_name='name', animation_frame='year', range_color=(0, 1), projection='natural earth', color_continuous_scale=px.colors.sequential.Plasma, title='Share of Renewable Energy Consumption (%)')fig.show()``````{python}#| label: MAP2#| echo: false#| warning: falseimport pandas as pdimport plotly.express as pximport geopandas as gpd# Load the datadata = pd.read_csv('data/owid-energy-data.csv')# Load world shapefile for mappingworld = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))# Merge world shapefile with energy dataworld = world.merge(data, how='left', left_on='iso_a3', right_on='iso_code')# Ensure data is filtered from 2000 to 2023world = world[(world['year'] >=2000) & (world['year'] <=2023)]# Calculate the share of renewable energy consumptionworld['renewables_share'] = world['renewables_electricity'] / world['electricity_generation']# Define color palettecolorscale = [ (0, "purple"), (0.5, "white"), (1, "green") ]# Plot the animated mapfig = px.choropleth(world, locations='iso_a3', color='renewables_share', hover_name='name', animation_frame='year', range_color=(0, 1), projection='natural earth', color_continuous_scale=colorscale, title='Share of Renewable Energy Consumption (%)')fig.show()``````{python}# Create renewable energy dataset as a subset of the datawind_data = data[['country', 'year', 'wind_consumption']]solar_data = data[['country', 'year', 'solar_consumption']]hydro_data = data[['country', 'year', 'hydro_consumption']]# remove all missing valueswind_data_clean = wind_data.dropna()solar_data_clean = solar_data.dropna()hydro_data_clean = hydro_data.dropna()# Group the data by year and sum the wind consumption for each year across all countriesgrouped_wind_data = wind_data_clean.groupby('year')['wind_consumption'].sum()grouped_solar_data = solar_data_clean.groupby('year')['solar_consumption'].sum()grouped_hydro_data = hydro_data_clean.groupby('year')['hydro_consumption'].sum()# Plot each energy type with appropriate labels and colorsplt.figure(figsize=(12, 10))plt.plot(grouped_wind_data.index, grouped_wind_data.values, marker='o', color='skyblue', label='Wind')plt.plot(grouped_solar_data.index, grouped_solar_data.values, marker='o', color='goldenrod', label='Solar')plt.plot(grouped_hydro_data.index, grouped_hydro_data.values, marker='o', color='seagreen', label='Hydro')plt.xlabel('Year')plt.ylabel('Energy Consumption (in terawatt-hours)') plt.title('Global Renewable Energy Consumption') plt.grid(True)plt.legend() # Add a legend to differentiate the linesplt.tight_layout() # Adjust the plot to ensure everything fits without overlapplt.show()``````{python}import matplotlib.pyplot as plt``````{python}``````{python}import matplotlib.dates as mdates# Create electriciy dataset as a subset of the dataelectricity_data = data[['country', 'year', 'greenhouse_gas_emissions']]# remove all missing valueselectricity_data_clean = electricity_data.dropna()# Convert 'year' to datetime for better handling in matplotlibelectricity_data_clean['year'] = pd.to_datetime(electricity_data_clean['year'], format='%Y')# Sorting the dataelectricity_data_clean.sort_values('year', inplace =True)# Trend plotplt.figure(figsize=(10, 6))plt.fill_between(electricity_data_clean['year'], electricity_data_clean['greenhouse_gas_emissions'], color="skyblue", alpha=0.4)plt.plot(electricity_data_clean['year'], electricity_data_clean['greenhouse_gas_emissions'], marker='o', color='skyblue')plt.gcf().autofmt_xdate() # Automatic rotation of the datesmyFmt = mdates.DateFormatter('%Y')plt.gca().xaxis.set_major_formatter(myFmt)plt.xlabel('Year')plt.ylabel('Emissions from Electricity generation (in megatonnes of CO2 equivalents)')plt.title('Trend of Global Greenhouse Gas Emissions from Electricity Generation')plt.grid(True)plt.tight_layout()plt.show()```# Repo OrganizationThe following folders comprise the project repository- **.github/:** This directory is designated for files associated with GitHub, encompassing workflows, actions, and templates tailored for issues.- **\_extra/:** Reserved for miscellaneous files that don't neatly fit into other project categories, providing a catch-all space for various supplementary documents.- **\_freeze/:** Within this directory lie frozen environment files containing comprehensive information regarding the project's environment configuration and dependencies.- **data/:** Specifically allocated for storing i data files crucial for the project's functionality, encompassing input files, datasets, and other essential data resources.- **images/:** Serving as a repository for visual assets employed throughout the project, including diagrams, charts, and screenshots, this directory maintains visual elements integral to project documentation and presentation.- **.gitignore:** This file functions to specify exclusions from version control, ensuring that designated files and directories remain untracked by Git, thus streamlining the versioning process.- **README.md:** Serving as the primary hub of project information, this README document furnishes essential details encompassing project setup, usage instructions, and an overarching overview of project objectives and scope.- **\_quarto.yml:** Acting as a pivotal configuration file for Quarto, this document encapsulates various settings and options governing the construction and rendering of Quarto documents, facilitating customization and control over document output.- **about.qmd:** This Quarto Markdown file supplements project documentation by providing additional contextual information, elucidating project purpose, contributor insights, and other pertinent project details.- **index.qmd:** index.qmd: This serves as the main documentation page for our project. This Quarto Markdown file provides detailed descriptions of our project, including all code and visualization .